These data appeared in the Wall Street Journal. The advertisement were selected by an annual survey conducted by Video Board Tests, Inc., a New York ad-testing company, based on interviews with 20,000 adults who were asked to name the most outstanding TV commercial they had seen, noticed, and liked. The retained impressions were based on a survey of 4,000 adults, in which regular product users were asked to cite a commercial they had seen for that product category in the past week. There are 21 observations on three variables.
Variables: Firm The name of the firm Spend TV advertising budget, 1983 ($ millions) MilImp Millions of retained impressions per week
These data is saved in text file (.txt) called TVads which will be imported to R as shown in th efollowing code. The path to the place the text file is saved must be changed accordingly.
TVads <- read.table("TVads.txt", header= TRUE)
head(TVads) # This will print only the first 6 rows of data file
## Firm Spend MilImp
## 1 MILLER.LITE 50.1 32.1
## 2 PEPSI 74.1 99.6
## 3 STROH'S 19.3 11.7
## 4 FEDERAL.EXPRESS 22.9 21.9
## 5 BURGER.KING 82.4 60.8
## 6 COCO-COLA 40.1 78.6
TVads # This will print the complete data file in the Console Window
## Firm Spend MilImp
## 1 MILLER.LITE 50.1 32.1
## 2 PEPSI 74.1 99.6
## 3 STROH'S 19.3 11.7
## 4 FEDERAL.EXPRESS 22.9 21.9
## 5 BURGER.KING 82.4 60.8
## 6 COCO-COLA 40.1 78.6
## 7 MC.DONALD'S 185.9 92.4
## 8 MCI 26.9 50.7
## 9 DIET.COLA 20.4 21.4
## 10 FORD 166.2 40.1
## 11 LEVI'S 27.0 40.8
## 12 BUD.LITE 45.6 10.4
## 13 ATT.BELL 154.9 88.9
## 14 CALVIN.KLEIN 5.0 12.0
## 15 WENDY'S 49.7 29.2
## 16 POLAROID 26.9 38.0
## 17 SHASTA 5.7 10.0
## 18 MEOW.MIX 7.6 12.3
## 19 OSCAR.MEYER 9.2 23.4
## 20 CREST 32.4 71.1
## 21 KIBBLES.N.BITS 6.1 4.4
# The following plot function creates the scatterplot shown in Figure 1 below
plot(TVads$Spend, TVads$MilImp, pch=16, xlab = "TV advertising budget, 1983 ($ millions) ", ylab = "Millions of retained impressions per week", main="Figure 1: Scatterplot of Millions of retained impressions per week
and TV advertising budget", cex.main=0.9)
Simple linear regression analysis in R
model <- lm(MilImp ~ Spend, data = TVads) # The lm function is used to fit simple linear regression model
model# This displays only the values of estimated regression coefficients
##
## Call:
## lm(formula = MilImp ~ Spend, data = TVads)
##
## Coefficients:
## (Intercept) Spend
## 22.1627 0.3632
summary(model) # The summary function is used to get more information about the fitted regression model #
##
## Call:
## lm(formula = MilImp ~ Spend, data = TVads)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.422 -12.623 -8.171 8.832 50.526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.16269 7.08948 3.126 0.00556 **
## Spend 0.36317 0.09712 3.739 0.00139 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.5 on 19 degrees of freedom
## Multiple R-squared: 0.424, Adjusted R-squared: 0.3936
## F-statistic: 13.98 on 1 and 19 DF, p-value: 0.001389
The output shows estimated regression coefficients, their standard errors (deviations), test statistics & p-value for testing significance. Also, there is R squared, degrees of freedom and F test statistic for testing significance of fitted model. You can use the abline function as shown below to plot the fitted regression model on the scatterplot.
plot(TVads$Spend, TVads$MilImp, pch=16, xlab = "TV advertising budget, 1983 ($ millions) ", ylab = "Millions of retained impressions per week", main="Figure 2: Fitted rgeression model oevrlayed on the scatterplot of
Millions of retained impressions per week and TV advertising budget", cex.main=0.9) # Create the scatterplot
abline(model, col = "red", lwd= 3) # Draw the fitted line #
The 95% CI for parameter estimates can be obtained using the following function:
confint(model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 7.324244 37.0011425
## Spend 0.159899 0.5664492
The 95% confidence interval for slope of the regression line is (0.1599, 0.5664). If you want to get the predicted value of response for each value of predictor, then use the predict function as shown below.
predict(model, TVads)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## 40.35771 49.07389 29.17195 30.47938 52.08824 36.72597 89.67675 31.93208 29.57144 82.52222 31.96839 38.72343 78.41836 23.97856 40.21244 31.93208 24.23279 24.92282 25.50389 33.92953 24.37806
This same function can be used to get prediction interval for the each value of predictor as shown below.
predict(model, TVads, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 40.35771 -9.989125 90.70455
## 2 49.07389 -1.502881 99.65067
## 3 29.17195 -21.570203 79.91411
## 4 30.47938 -20.176808 81.13557
## 5 52.08824 1.322964 102.85351
## 6 36.72597 -13.664345 87.11629
## 7 89.67675 32.288077 147.06543
## 8 31.93208 -18.640841 82.50499
## 9 29.57144 -21.143338 80.28623
## 10 82.52222 26.944377 138.10007
## 11 31.96839 -18.602607 82.53939
## 12 38.72343 -11.632825 89.07969
## 13 78.41836 23.773745 133.06297
## 14 23.97856 -27.207071 75.16420
## 15 40.21244 -10.134558 90.55945
## 16 31.93208 -18.640841 82.50499
## 17 24.23279 -26.927385 75.39296
## 18 24.92282 -26.170173 76.01581
## 19 25.50389 -25.534718 76.54251
## 20 33.92953 -16.550051 84.40912
## 21 24.37806 -26.767737 75.52385
You may want to compute prediction values of response for a given set of new values of predictor. In that case, first, create a new data file with the new value of predictor. Then use the predictor function to predict response as shown below. Here, our new values of the Spend variable are 50 and 100.
NewData <- data.frame(Spend = c(50, 100)) # Creates a new data frame with only new values of predictor variable
predict(model, NewData, interval = "prediction", level = 0.95) # Use the new data frame in the Predictor function instead of old data set
## fit lwr upr
## 1 40.3214 -10.025471 90.66826
## 2 58.4801 7.133667 109.82653
The above output shows that there is $42.32 million of average retained impressions per week when TV advertisement budget is $ 50 million. To check the model diagnostics using residual plots, use the code below.
plot(model) # This plots the diagnostic graphs for the fitted regression model. Fit enter key 4 times to see the plots shown below #
When you want to do multiple linear regression in R, you can add other predictors to the lm function each separated by a plus sign.
# install.packages("tidyverse")
To use R packages you have installed, include this line in your script or in the Console:
library("tidyverse")
# Note here the quotation marks are optional and
# it is the same as
library(tidyverse)
my_df <- read_table2("TVads.txt")
my_df
## # A tibble: 21 x 3
## Firm Spend MilImp
## <chr> <dbl> <dbl>
## 1 MILLER.LITE 50.1 32.1
## 2 PEPSI 74.1 99.6
## 3 STROH'S 19.3 11.7
## 4 FEDERAL.EXPRESS 22.9 21.9
## 5 BURGER.KING 82.4 60.8
## 6 COCO-COLA 40.1 78.6
## 7 MC.DONALD'S 186. 92.4
## 8 MCI 26.9 50.7
## 9 DIET.COLA 20.4 21.4
## 10 FORD 166. 40.1
## # ... with 11 more rows
Of course, the step of import data into your statistic software vary a little by software and by the data format.
More information as of how to import data into R can be found at Import Data
ggplot(my_df, aes(x=MilImp)) + geom_histogram()
ggplot(my_df, aes(x=Spend)) + geom_histogram()
ggplot(my_df, aes(x=Spend, y=MilImp)) + geom_point()
cor(my_df$MilImp, my_df$Spend)
## [1] 0.6511151
In R, run linear regressions with lm (short for linear model):
lm(MilImp ~ Spend, data=my_df)
##
## Call:
## lm(formula = MilImp ~ Spend, data = my_df)
##
## Coefficients:
## (Intercept) Spend
## 22.1627 0.3632
lm() to summary() for more detailed information:lm(MilImp ~ Spend, data=my_df) %>%
summary
##
## Call:
## lm(formula = MilImp ~ Spend, data = my_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.422 -12.623 -8.171 8.832 50.526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.16269 7.08948 3.126 0.00556 **
## Spend 0.36317 0.09712 3.739 0.00139 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.5 on 19 degrees of freedom
## Multiple R-squared: 0.424, Adjusted R-squared: 0.3936
## F-statistic: 13.98 on 1 and 19 DF, p-value: 0.001389
For better formatting of the results (pretty print), we can use the texreg package:
## Install and load texreg package
# install.packages("texreg")
library(texreg)
# Pretty print regression results on screen
lm(MilImp ~ Spend, data=my_df) %>%
screenreg
##
## =====================
## Model 1
## ---------------------
## (Intercept) 22.16 **
## (7.09)
## Spend 0.36 **
## (0.10)
## ---------------------
## R^2 0.42
## Adj. R^2 0.39
## Num. obs. 21
## RMSE 23.50
## =====================
## *** p < 0.001, ** p < 0.01, * p < 0.05
# Save regression results to a html file
lm(MilImp ~ Spend, data=my_df) %>%
htmlreg(file="lm_MilImp-Spend.html")
ggplot(my_df, aes(x=Spend, y=MilImp)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
We will use the ggfortify package to generate the diagnostic plots for regression
## Install and load ggfortify package
# install.packages("ggfortify")
library(ggfortify)
lm(MilImp ~ Spend, data=my_df) %>%
autoplot()
# Since we need the regression results in many places,
# it would be easier to save it into a R variable
MilImp_lm <- lm(MilImp ~ Spend, data=my_df)
# Save the diagnostic plots as a png file
ggsave("MilImp_lm_diag.png")
A hospital is implementing a program to improve service quality and productivity. As part of this program the hospital management is attempting to measure and evaluate patient satisfaction. Table B.17 contains some of the data that have been collected on a random sample of 25 recently discharged patients. Variables: Satisfaction - a subjective measure on an increasing scale Age – age of patient in years Severity – an index measuring the severity of the patient’s illness Surgical-Medical - an indicator of whether the patient is a surgical or medical patient (0 = surgical, 1 = medical) Anxiety - an index measuring the patient’s anxiety level The patient satisfaction is thought to be related to the patient age.
https://rpubs.com/aaronsc32/regression-confidence-prediction-intervals
# Load the package
library(readxl)
library(tidyverse)
library(broom)
# Import Data
table_b17 <- read_xlsx("TableB.17.xlsx")
# Observe the data frame
head(table_b17)
# visualizng Satisfaction and Age
table_b17 %>% ggplot(aes(Age,Satisfaction))+geom_point()+geom_smooth(method="lm", level=0.95)
The plot shows a decreasing approximately linear pattern.
The model is \(Satisfaction=130.221-1.249Age\)
model_satis_age %>% summary()
According to the analysis-of-variance table, Multiple R-squared is 0.7205. The fitted model for these data can explain 72.05% of total variation in the patient satisfaction.
the standard error of estimated intercept is 7.7775.
The table shows that p-value is less than 1.52e-08. The regression model is significant at 95% significance level.
The equation for the 95% CI for the estimated β1 is defined as:
\[\beta_1 \pm t_{\alpha / 2, n - 2} \left(\frac{\sqrt{MSE}}{\sqrt{\sum (x_i - \bar{x})^2}}\right)\]
model_satis_age %>% confint(level=0.95)
The fitted β0 is 130.2209. The 95% confidence interval for the intercept of the regression line is (114.131844, 146.3100181).
The confidence interval for β0 does not contain 0, it can be concluded the true intercept is different from zero.
anova(model_satis_age)
The table shows that p-value is less than 1.52e-08. The regression model is significant at 95% significance level.
The confidence interval around the mean response, denoted \(\mu_y\), when the predictor value is \(x_k\) is defined as:
\[\hat{y}_h \pm t_{\alpha / 2, n - 2} \sqrt{MSE \left(\frac{1}{n} + \frac{(x_k - \bar{x})^2}{\sum(x_i - \bar{x}^2)} \right)}\]
Where \(\hat{y}_h\) is the fitted response for predictor value \(x_h\), \(t_{\alpha/2,n-2}\) is the t-value with n−2 degrees of freedom, while \(\sqrt{MSE \left(\frac{1}{n} + \frac{(x_k - \bar{x})^2}{\sum(x_i - \bar{x}^2)} \right)}\) is the standard error of the fit.
model_satis_age %>% predict(newdata = data.frame(Age=65), interval = "confidence", level=0.95)
From the output, the fitted satisfaction is 49.03367 when the patient’s age is 65 years old.
According the result from (i), the confidence interval of (42.86397, 55.20336) signifies the range in which the true population parameter lies at a 95% level of confidence.
model_satis_age %>% predict(newdata = data.frame(Age=65), interval = "prediction", level=0.95)
the 95% prediction interval is (26.11012, 71.95721) for the patient satisfaction when the patient is 65 years old
The results show the prediction interval is wider than confidence interval due to the additional term in the standard error of prediction. Prediction and confidence intervals are similar in that they are both predicting a response, however, they differ in what is being represented and interpreted.The best predictor of a random variable (assuming the variable is normally distributed) is the mean \(\mu\). The best predictor for an observation from a sample of x data points, x1,x2,⋯,xn and error \(\epsilon\) is \(\bar x\). The prediction interval depends on both the error from the fitted model and the error associated with future observations.
t0= (-1.2490+1)/0.1471
2*pt(q = t0, df = nrow(table_b17)-2 , lower.tail = TRUE) # Gives the P(T < t0) from the t distribution. Enter the value for df of the t distribution. Use FALSE instead of TRUE to compute P(T >t0). #
The result shows that the p_value is 0.1040118, which is bigger than 0.05. Therefore, we can reject the H1 and the ture slope might be same with -1.
# Load the package
library(readxl)
library(tidyverse)
library(broom)
library(olsrr)
library(car)
# Import Data
table_b4 <- read_table2("TableB4.txt")
# Observe the data frame
head(table_b4)
## # A tibble: 6 x 10
## y x1 x2 x3 x4 x5 x6 x7 x8 x9
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int>
## 1 25.9 4.92 1 3.47 0.998 1 7 4 42 0
## 2 29.5 5.02 1 3.53 1.5 2 7 4 62 0
## 3 27.9 4.54 1 2.28 1.18 1 6 3 40 0
## 4 25.9 4.56 1 4.05 1.23 1 6 3 54 0
## 5 29.9 5.06 1 4.46 1.12 1 6 3 42 0
## 6 29.9 3.89 1 4.46 0.988 1 6 3 56 0
summary(table_b4)
## y x1 x2 x3 x4 x5 x6 x7 x8 x9
## Min. :25.90 Min. :3.891 Min. :1.000 Min. :2.275 Min. :0.975 Min. :0.000 Min. :5.0 Min. :2.000 Min. : 3.00 Min. :0.00
## 1st Qu.:29.90 1st Qu.:5.057 1st Qu.:1.000 1st Qu.:4.855 1st Qu.:1.161 1st Qu.:1.000 1st Qu.:6.0 1st Qu.:3.000 1st Qu.:30.00 1st Qu.:0.00
## Median :33.70 Median :5.974 Median :1.000 Median :5.685 Median :1.432 Median :1.000 Median :6.0 Median :3.000 Median :40.00 Median :0.00
## Mean :34.61 Mean :6.405 Mean :1.167 Mean :6.033 Mean :1.384 Mean :1.312 Mean :6.5 Mean :3.167 Mean :37.46 Mean :0.25
## 3rd Qu.:38.15 3rd Qu.:7.873 3rd Qu.:1.500 3rd Qu.:7.158 3rd Qu.:1.577 3rd Qu.:2.000 3rd Qu.:7.0 3rd Qu.:3.250 3rd Qu.:48.50 3rd Qu.:0.25
## Max. :45.80 Max. :9.142 Max. :1.500 Max. :9.890 Max. :1.831 Max. :2.000 Max. :8.0 Max. :4.000 Max. :62.00 Max. :1.00
# visualizng Satisfaction and Age
plot(table_b4,pch=16)
# matrix of correlation coefficients
cor(table_b4)
## y x1 x2 x3 x4 x5 x6 x7 x8 x9
## y 1.0000000 0.8759762 0.7128360 0.6493554 0.7101541 0.45885885 0.5308653 0.2827929 -0.39851772 0.2623629
## x1 0.8759762 1.0000000 0.6512433 0.6892118 0.7342610 0.45857369 0.6405621 0.3670500 -0.43711531 0.1466709
## x2 0.7128360 0.6512433 1.0000000 0.4129438 0.7285916 0.22402204 0.5103104 0.4264014 -0.10074847 0.2041241
## x3 0.6493554 0.6892118 0.4129438 1.0000000 0.5715551 0.20466834 0.3921362 0.1516178 -0.35276370 0.3059960
## x4 0.7101541 0.7342610 0.7285916 0.5715551 1.0000000 0.35888351 0.6788606 0.5743353 -0.13908686 0.1065612
## x5 0.4588588 0.4585737 0.2240220 0.2046683 0.3588835 1.00000000 0.5893871 0.5412988 -0.02016883 0.1016185
## x6 0.5308653 0.6405621 0.5103104 0.3921362 0.6788606 0.58938707 1.0000000 0.8703883 0.12426629 0.2222222
## x7 0.2827929 0.3670500 0.4264014 0.1516178 0.5743353 0.54129880 0.8703883 1.0000000 0.31351144 0.0000000
## x8 -0.3985177 -0.4371153 -0.1007485 -0.3527637 -0.1390869 -0.02016883 0.1242663 0.3135114 1.00000000 0.2257796
## x9 0.2623629 0.1466709 0.2041241 0.3059960 0.1065612 0.10161846 0.2222222 0.0000000 0.22577960 1.0000000
table_b4 %>% corrr::correlate() %>% corrr::fashion()
## rowname y x1 x2 x3 x4 x5 x6 x7 x8 x9
## 1 y .88 .71 .65 .71 .46 .53 .28 -.40 .26
## 2 x1 .88 .65 .69 .73 .46 .64 .37 -.44 .15
## 3 x2 .71 .65 .41 .73 .22 .51 .43 -.10 .20
## 4 x3 .65 .69 .41 .57 .20 .39 .15 -.35 .31
## 5 x4 .71 .73 .73 .57 .36 .68 .57 -.14 .11
## 6 x5 .46 .46 .22 .20 .36 .59 .54 -.02 .10
## 7 x6 .53 .64 .51 .39 .68 .59 .87 .12 .22
## 8 x7 .28 .37 .43 .15 .57 .54 .87 .31 .00
## 9 x8 -.40 -.44 -.10 -.35 -.14 -.02 .12 .31 .23
## 10 x9 .26 .15 .20 .31 .11 .10 .22 .00 .23
table_b4 %>% corrr::correlate() %>% corrr::rplot()
# build the model
model_b4_full <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9, data=table_b4)
model_b4_full%>% summary()
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9,
## data = table_b4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.720 -1.956 -0.045 1.627 4.253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.92765 5.91285 2.525 0.0243 *
## x1 1.92472 1.02990 1.869 0.0827 .
## x2 7.00053 4.30037 1.628 0.1258
## x3 0.14918 0.49039 0.304 0.7654
## x4 2.72281 4.35955 0.625 0.5423
## x5 2.00668 1.37351 1.461 0.1661
## x6 -0.41012 2.37854 -0.172 0.8656
## x7 -1.40324 3.39554 -0.413 0.6857
## x8 -0.03715 0.06672 -0.557 0.5865
## x9 1.55945 1.93750 0.805 0.4343
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.949 on 14 degrees of freedom
## Multiple R-squared: 0.8531, Adjusted R-squared: 0.7587
## F-statistic: 9.037 on 9 and 14 DF, p-value: 0.000185
anova(model_b4_full)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 636.16 636.16 73.1525 6.238e-07 ***
## x2 1 29.18 29.18 3.3551 0.08836 .
## x3 1 4.71 4.71 0.5416 0.47391
## x4 1 0.03 0.03 0.0032 0.95537
## x5 1 8.78 8.78 1.0091 0.33216
## x6 1 13.03 13.03 1.4982 0.24115
## x7 1 9.14 9.14 1.0515 0.32254
## x8 1 0.64 0.64 0.0741 0.78943
## x9 1 5.63 5.63 0.6478 0.43435
## Residuals 14 121.75 8.70
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vcov(model_b4_full)
## (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9
## (Intercept) 34.9618142 1.4185704 -8.05381699 -0.578910413 1.544918516 0.590245661 -9.210285660 9.29003130 -0.122000743 5.23739149
## x1 1.4185704 1.0606966 -1.95694395 -0.188439886 -0.796712965 -0.449355607 -1.401396213 1.56627449 0.024816000 0.47756201
## x2 -8.0538170 -1.9569440 18.49319704 0.498255845 -7.835636978 1.385215724 2.985701832 -4.19172566 -0.011989093 -2.61735808
## x3 -0.5789104 -0.1884399 0.49825584 0.240479765 -0.613877784 0.105719049 0.097666359 -0.08053358 0.004479017 -0.33268222
## x4 1.5449185 -0.7967130 -7.83563698 -0.613877784 19.005706515 0.473391133 -0.488584336 -2.34434350 -0.004688778 1.04295974
## x5 0.5902457 -0.4493556 1.38521572 0.105719049 0.473391133 1.886525901 0.147174179 -1.33029505 0.007405142 -0.47388005
## x6 -9.2102857 -1.4013962 2.98570183 0.097666359 -0.488584336 0.147174179 5.657473253 -6.79090191 -0.003215709 -2.20865227
## x7 9.2900313 1.5662745 -4.19172566 -0.080533584 -2.344343503 -1.330295048 -6.790901910 11.52970487 -0.058993674 3.53532750
## x8 -0.1220007 0.0248160 -0.01198909 0.004479017 -0.004688778 0.007405142 -0.003215709 -0.05899367 0.004451546 -0.04896325
## x9 5.2373915 0.4775620 -2.61735808 -0.332682217 1.042959743 -0.473880052 -2.208652269 3.53532750 -0.048963249 3.75389020
vif(model_b4_full)
## x1 x2 x3 x4 x5 x6 x7 x8 x9
## 7.021036 2.835413 2.454907 3.836477 1.823605 11.710101 9.722663 2.320887 1.942494
ols_vif_tol(model_b4_full)
## # A tibble: 9 x 3
## Variables Tolerance VIF
## <chr> <dbl> <dbl>
## 1 x1 0.142 7.02
## 2 x2 0.353 2.84
## 3 x3 0.407 2.45
## 4 x4 0.261 3.84
## 5 x5 0.548 1.82
## 6 x6 0.0854 11.7
## 7 x7 0.103 9.72
## 8 x8 0.431 2.32
## 9 x9 0.515 1.94
alias(lm(y ~ as.factor(x2) + as.factor(x5) + as.factor(x6) + as.factor(x7) + as.factor(x9), data=table_b4))
## Model :
## y ~ as.factor(x2) + as.factor(x5) + as.factor(x6) + as.factor(x7) +
## as.factor(x9)
##
## Complete :
## (Intercept) as.factor(x2)1.5 as.factor(x5)1 as.factor(x5)1.5 as.factor(x5)2 as.factor(x6)6 as.factor(x6)7 as.factor(x6)8 as.factor(x7)3 as.factor(x9)1
## as.factor(x7)4 0 0 0 0 0 1 1 1 -1 0
plot(model_b4_full,pch=16,col="blue")
avPlots(model_b4_full)
ols_plot_added_variable(model_b4_full)
yhat_b4_full <- model_b4_full$fit # Save predicted y to an object #
t_b4_full <- rstudent(model_b4_full) # Save studentized residuals to an object #
# Another way to check non-normality of studentized residuals #
hist(t_b4_full)
ols_plot_resid_hist(model_b4_full)
qqnorm(t_b4_full, pch=15)
ols_plot_resid_qq(model_b4_full)
shapiro.test(t_b4_full) #If p-value is bigger, then no problem of non-normality #
##
## Shapiro-Wilk normality test
##
## data: t_b4_full
## W = 0.96485, p-value = 0.5432
plot(yhat_b4_full,t_b4_full, pch=16) # Studentized residuals versus predicted y #
plot(table_b4$x1,t_b4_full, pch=16) # Studentized residuals versus x1 predictor #
plot(table_b4$x2,t_b4_full, pch=16) # Studentized residuals versus x2 predictor #
plot(table_b4$x3,t_b4_full, pch=16) # Studentized residuals versus x3 predictor #
plot(table_b4$x4,t_b4_full, pch=16) # Studentized residuals versus x4 predictor #
plot(table_b4$x5,t_b4_full, pch=16) # Studentized residuals versus x5 predictor #
plot(table_b4$x6,t_b4_full, pch=16) # Studentized residuals versus x6 predictor #
plot(table_b4$x7,t_b4_full, pch=16) # Studentized residuals versus x7 predictor #
plot(table_b4$x8,t_b4_full, pch=16) # Studentized residuals versus x8 predictor #
plot(table_b4$x9,t_b4_full, pch=16) # Studentized residuals versus x9 predictor #
table_b4 %>% mutate(student_residual=t_b4_full) %>% plot()
ols_plot_diagnostics(model_b4_full)
ols_plot_dfbetas(model_b4_full)
# Lack of Fit F Test
ols_pure_error_anova(lm(y~x7, data = table_b4))
## Lack of Fit F Test
## ---------------
## Response : y
## Predictor: x7
##
## Analysis of Variance Table
## ---------------------------------------------------------------------
## DF Sum Sq Mean Sq F Value Pr(>F)
## ---------------------------------------------------------------------
## x7 1 66.30034 66.30034 1.828601 0.190031
## Residual 22 762.7459 34.67027
## Lack of fit 1 1.340076 1.340076 0.03696004 0.8493934
## Pure Error 21 761.4058 36.25742
## ---------------------------------------------------------------------
MPV::PRESS(model_b4_full)
## [1] 393.492
ols_press(model_b4_full)
## [1] 393.492
library(MASS)
bc <- boxcox(model_b4_full)
#To plot log-likelihood versus lambda #
lambda <- bc$x[which(bc$y==max(bc$y))]
lambda
## [1] -0.8282828
# another package forecast for boxcox
# library(forecast)
# lambda <- BoxCox.lambda(model_b4_full)
# BoxCox(model_b4_full, lambda="auto")
# InvBoxCox(model_b4_full, lambda="auto", biasadj = FALSE, fvar = NULL)
# < log Likelihood
logLik(model_b4_full)
## 'log Lik.' -53.54133 (df=11)
model_b4_none <- update(model_b4_full, .~1)
logLik(model_b4_none)
## 'log Lik.' -76.56119 (df=2)
## 'log Lik.' -21.61487 (df=1)
# pseudo R2
1 - logLik(model_b4_full)/logLik(model_b4_none)
## 'log Lik.' 0.3006726 (df=11)
## 'log Lik.' 0.381052 (df=3)
# Interpretation of coefficients
# odds ratio
odds <- exp(coef(model_b4_full))
#prob >
odds/(1 + odds)
## (Intercept) x1 x2 x3 x4 x5 x6 x7 x8 x9
## 0.9999997 0.8726640 0.9990894 0.5372255 0.9383591 0.8814971 0.3988824 0.1973032 0.4907138 0.8262739
# Forward Selection Regression #
model_b4_none <- lm (y ~ 1, data = table_b4)
step(model_b4_none, direction= "forward", scope=(~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9), k=2)
## Start: AIC=87.01
## y ~ 1
##
## Df Sum of Sq RSS AIC
## + x1 1 636.16 192.89 54.018
## + x2 1 421.27 407.78 71.984
## + x4 1 418.10 410.94 72.170
## + x3 1 349.58 479.47 75.871
## + x6 1 233.64 595.41 81.069
## + x5 1 174.56 654.49 83.339
## + x8 1 131.67 697.38 84.863
## + x7 1 66.30 762.75 87.013
## <none> 829.05 87.013
## + x9 1 57.07 771.98 87.302
##
## Step: AIC=54.02
## y ~ x1
##
## Df Sum of Sq RSS AIC
## + x2 1 29.1766 163.71 52.082
## <none> 192.89 54.018
## + x9 1 15.1870 177.70 54.050
## + x4 1 8.0654 184.82 54.993
## + x5 1 3.4299 189.46 55.587
## + x3 1 3.2869 189.60 55.605
## + x7 1 1.4375 191.45 55.838
## + x6 1 1.2867 191.60 55.857
## + x8 1 0.2499 192.64 55.987
##
## Step: AIC=52.08
## y ~ x1 + x2
##
## Df Sum of Sq RSS AIC
## <none> 163.71 52.082
## + x9 1 9.9142 153.80 52.582
## + x7 1 7.4562 156.26 52.963
## + x5 1 6.0754 157.64 53.174
## + x3 1 4.7101 159.00 53.381
## + x8 1 4.1231 159.59 53.469
## + x6 1 4.0956 159.62 53.474
## + x4 1 0.0602 163.65 54.073
##
## Call:
## lm(formula = y ~ x1 + x2, data = table_b4)
##
## Coefficients:
## (Intercept) x1 x2
## 10.042 2.713 6.164
ols_step_forward_aic(model_b4_full)
## Forward Selection Method
## ------------------------
##
## Candidate Terms:
##
## 1 . x1
## 2 . x2
## 3 . x3
## 4 . x4
## 5 . x5
## 6 . x6
## 7 . x7
## 8 . x8
## 9 . x9
##
##
## Variables Entered:
##
## - x1
## - x2
##
## No more variables to be added.
##
## Selection Summary
## ------------------------------------------------------------------
## Variable AIC Sum Sq RSS R-Sq Adj. R-Sq
## ------------------------------------------------------------------
## x1 124.127 636.156 192.891 0.76733 0.75676
## x2 122.191 665.332 163.714 0.80253 0.78372
## ------------------------------------------------------------------
ols_step_forward_p(model_b4_full)
## Forward Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. x1
## 2. x2
## 3. x3
## 4. x4
## 5. x5
## 6. x6
## 7. x7
## 8. x8
## 9. x9
##
## We are selecting variables based on p value...
##
## Variables Entered:
##
## - x1
## - x2
## - x9
## - x8
## - x5
##
## No more variables to be added.
##
## Final Model Output
## ------------------
##
## Model Summary
## -------------------------------------------------------------
## R 0.916 RMSE 2.725
## R-Squared 0.839 Coef. Var 7.872
## Adj. R-Squared 0.794 MSE 7.425
## Pred R-Squared 0.671 MAE 1.998
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -------------------------------------------------------------------
## Regression 695.403 5 139.081 18.732 0.0000
## Residual 133.643 18 7.425
## Total 829.046 23
## -------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 13.737 3.939 3.488 0.003 5.463 22.012
## x1 2.017 0.631 0.531 3.198 0.005 0.692 3.342
## x2 7.224 3.290 0.290 2.196 0.041 0.313 14.136
## x9 2.034 1.377 0.150 1.478 0.157 -0.858 4.927
## x8 -0.072 0.051 -0.168 -1.417 0.174 -0.179 0.035
## x5 1.307 1.105 0.132 1.184 0.252 -1.013 3.628
## ----------------------------------------------------------------------------------------
##
## Selection Summary
## ------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------
## 1 x1 0.7673 0.7568 2.1808 124.1267 2.9610
## 2 x2 0.8025 0.7837 0.8257 122.1907 2.7921
## 3 x9 0.8145 0.7867 1.6857 122.6914 2.7731
## 4 x8 0.8263 0.7897 2.5639 123.1187 2.7534
## 5 x5 0.8388 0.7940 3.3678 123.3200 2.7248
## ------------------------------------------------------------------------
# k=2 for AIC the default , k= log(n) is sometimes for BIC where n is the number of data #
# Backward Elimination Regression #
step(model_b4_full, direction = "backward")
## Start: AIC=58.97
## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9
##
## Df Sum of Sq RSS AIC
## - x6 1 0.2585 122.01 57.025
## - x3 1 0.8048 122.55 57.132
## - x7 1 1.4852 123.23 57.265
## - x8 1 2.6960 124.44 57.499
## - x4 1 3.3922 125.14 57.633
## - x9 1 5.6337 127.38 58.059
## <none> 121.75 58.974
## - x5 1 18.5622 140.31 60.379
## - x2 1 23.0454 144.79 61.134
## - x1 1 30.3724 152.12 62.319
##
## Step: AIC=57.02
## y ~ x1 + x2 + x3 + x4 + x5 + x7 + x8 + x9
##
## Df Sum of Sq RSS AIC
## - x3 1 0.889 122.90 55.199
## - x8 1 2.731 124.74 55.556
## - x4 1 3.312 125.32 55.667
## - x9 1 5.889 127.90 56.156
## - x7 1 9.249 131.26 56.778
## <none> 122.01 57.025
## - x5 1 18.798 140.81 58.464
## - x2 1 26.774 148.78 59.786
## - x1 1 40.508 162.51 61.905
##
## Step: AIC=55.2
## y ~ x1 + x2 + x4 + x5 + x7 + x8 + x9
##
## Df Sum of Sq RSS AIC
## - x8 1 3.245 126.14 53.824
## - x4 1 4.744 127.64 54.108
## - x9 1 8.718 131.61 54.844
## - x7 1 9.501 132.40 54.986
## <none> 122.90 55.199
## - x5 1 17.987 140.88 56.477
## - x2 1 25.930 148.83 57.793
## - x1 1 53.969 176.87 61.936
##
## Step: AIC=53.82
## y ~ x1 + x2 + x4 + x5 + x7 + x9
##
## Df Sum of Sq RSS AIC
## - x4 1 4.935 131.07 52.745
## - x9 1 5.839 131.98 52.910
## <none> 126.14 53.824
## - x5 1 19.015 145.16 55.194
## - x7 1 22.338 148.48 55.737
## - x2 1 24.770 150.91 56.127
## - x1 1 95.846 221.99 65.390
##
## Step: AIC=52.75
## y ~ x1 + x2 + x5 + x7 + x9
##
## Df Sum of Sq RSS AIC
## - x9 1 5.540 136.62 51.739
## <none> 131.07 52.745
## - x5 1 16.852 147.93 53.648
## - x7 1 17.471 148.55 53.748
## - x2 1 39.893 170.97 57.122
## - x1 1 158.792 289.87 69.793
##
## Step: AIC=51.74
## y ~ x1 + x2 + x5 + x7
##
## Df Sum of Sq RSS AIC
## <none> 136.62 51.739
## - x5 1 19.643 156.26 52.963
## - x7 1 21.024 157.64 53.174
## - x2 1 47.648 184.26 56.919
## - x1 1 157.504 294.12 68.142
##
## Call:
## lm(formula = y ~ x1 + x2 + x5 + x7, data = table_b4)
##
## Coefficients:
## (Intercept) x1 x2 x5 x7
## 13.509 2.419 8.480 2.001 -2.182
ols_step_backward_aic(model_b4_full)
## Backward Elimination Method
## ---------------------------
##
## Candidate Terms:
##
## 1 . x1
## 2 . x2
## 3 . x3
## 4 . x4
## 5 . x5
## 6 . x6
## 7 . x7
## 8 . x8
## 9 . x9
##
##
## Variables Removed:
##
## - x6
## - x3
## - x8
## - x4
## - x9
##
## No more variables to be removed.
##
##
## Backward Elimination Summary
## -------------------------------------------------------------------
## Variable AIC RSS Sum Sq R-Sq Adj. R-Sq
## -------------------------------------------------------------------
## Full Model 129.083 121.748 707.298 0.85315 0.75874
## x6 127.134 122.007 707.040 0.85283 0.77435
## x3 125.308 122.896 706.150 0.85176 0.78691
## x8 123.933 126.141 702.906 0.84785 0.79415
## x4 122.854 131.075 697.971 0.84190 0.79798
## x9 121.848 136.615 692.431 0.83521 0.80052
## -------------------------------------------------------------------
ols_step_backward_p(model_b4_full)
## Backward Elimination Method
## ---------------------------
##
## Candidate Terms:
##
## 1 . x1
## 2 . x2
## 3 . x3
## 4 . x4
## 5 . x5
## 6 . x6
## 7 . x7
## 8 . x8
## 9 . x9
##
## We are eliminating variables based on p value...
##
## Variables Removed:
##
## - x6
## - x3
## - x8
## - x4
## - x9
##
## No more variables satisfy the condition of p value = 0.3
##
##
## Final Model Output
## ------------------
##
## Model Summary
## -------------------------------------------------------------
## R 0.914 RMSE 2.681
## R-Squared 0.835 Coef. Var 7.747
## Adj. R-Squared 0.801 MSE 7.190
## Pred R-Squared 0.718 MAE 1.974
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -------------------------------------------------------------------
## Regression 692.431 4 173.108 24.075 0.0000
## Residual 136.615 19 7.190
## Total 829.046 23
## -------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 13.509 3.633 3.718 0.001 5.905 21.113
## x1 2.419 0.517 0.637 4.680 0.000 1.337 3.501
## x2 8.480 3.294 0.340 2.574 0.019 1.585 15.375
## x5 2.001 1.210 0.201 1.653 0.115 -0.533 4.534
## x7 -2.182 1.276 -0.205 -1.710 0.104 -4.853 0.489
## ----------------------------------------------------------------------------------------
##
##
## Elimination Summary
## ------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------
## 1 x6 0.8528 0.7743 8.0297 127.1336 2.8520
## 2 x3 0.8518 0.7869 6.1320 125.3079 2.7715
## 3 x8 0.8478 0.7941 4.5051 123.9333 2.7240
## 4 x4 0.8419 0.798 3.0725 122.8543 2.6985
## 5 x9 0.8352 0.8005 1.7095 121.8477 2.6815
## ------------------------------------------------------------------------
# Subset method #
library(leaps) # Load the package #
model_b4_subset <- regsubsets(y ~ x1 + x2 + x3 +x4 + x5 + x6 + x7 + x8 + x9, data=table_b4, nbest=10 ) # nbest is the number of models from each size #
summary(model_b4_subset) # Hard to read output from this #
## Subset selection object
## Call: regsubsets.formula(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 +
## x9, data = table_b4, nbest = 10)
## 9 Variables (and intercept)
## Forced in Forced out
## x1 FALSE FALSE
## x2 FALSE FALSE
## x3 FALSE FALSE
## x4 FALSE FALSE
## x5 FALSE FALSE
## x6 FALSE FALSE
## x7 FALSE FALSE
## x8 FALSE FALSE
## x9 FALSE FALSE
## 10 subsets of each size up to 8
## Selection Algorithm: exhaustive
## x1 x2 x3 x4 x5 x6 x7 x8 x9
## 1 ( 1 ) "*" " " " " " " " " " " " " " " " "
## 1 ( 2 ) " " "*" " " " " " " " " " " " " " "
## 1 ( 3 ) " " " " " " "*" " " " " " " " " " "
## 1 ( 4 ) " " " " "*" " " " " " " " " " " " "
## 1 ( 5 ) " " " " " " " " " " "*" " " " " " "
## 1 ( 6 ) " " " " " " " " "*" " " " " " " " "
## 1 ( 7 ) " " " " " " " " " " " " " " "*" " "
## 1 ( 8 ) " " " " " " " " " " " " "*" " " " "
## 1 ( 9 ) " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) "*" "*" " " " " " " " " " " " " " "
## 2 ( 2 ) "*" " " " " " " " " " " " " " " "*"
## 2 ( 3 ) "*" " " " " "*" " " " " " " " " " "
## 2 ( 4 ) "*" " " " " " " "*" " " " " " " " "
## 2 ( 5 ) "*" " " "*" " " " " " " " " " " " "
## 2 ( 6 ) "*" " " " " " " " " " " "*" " " " "
## 2 ( 7 ) "*" " " " " " " " " "*" " " " " " "
## 2 ( 8 ) "*" " " " " " " " " " " " " "*" " "
## 2 ( 9 ) " " "*" "*" " " " " " " " " " " " "
## 2 ( 10 ) " " "*" " " " " " " " " " " "*" " "
## 3 ( 1 ) "*" "*" " " " " " " " " " " " " "*"
## 3 ( 2 ) "*" "*" " " " " " " " " "*" " " " "
## 3 ( 3 ) "*" "*" " " " " "*" " " " " " " " "
## 3 ( 4 ) "*" "*" "*" " " " " " " " " " " " "
## 3 ( 5 ) "*" "*" " " " " " " " " " " "*" " "
## 3 ( 6 ) "*" "*" " " " " " " "*" " " " " " "
## 3 ( 7 ) "*" "*" " " "*" " " " " " " " " " "
## 3 ( 8 ) "*" " " " " "*" " " " " " " " " "*"
## 3 ( 9 ) "*" " " " " " " " " " " " " "*" "*"
## 3 ( 10 ) "*" " " " " " " " " "*" " " " " "*"
## 4 ( 1 ) "*" "*" " " " " "*" " " "*" " " " "
## 4 ( 2 ) "*" "*" " " " " " " " " " " "*" "*"
## 4 ( 3 ) "*" "*" " " " " "*" "*" " " " " " "
## 4 ( 4 ) "*" "*" " " " " " " "*" " " " " "*"
## 4 ( 5 ) "*" "*" " " " " " " " " "*" " " "*"
## 4 ( 6 ) "*" "*" " " " " "*" " " " " " " "*"
## 4 ( 7 ) "*" "*" " " " " "*" " " " " "*" " "
## 4 ( 8 ) "*" "*" "*" " " "*" " " " " " " " "
## 4 ( 9 ) "*" "*" "*" " " " " " " " " " " "*"
## 4 ( 10 ) "*" "*" "*" " " " " " " "*" " " " "
## 5 ( 1 ) "*" "*" "*" " " "*" " " "*" " " " "
## 5 ( 2 ) "*" "*" " " " " "*" " " "*" " " "*"
## 5 ( 3 ) "*" "*" " " "*" "*" " " "*" " " " "
## 5 ( 4 ) "*" "*" " " " " "*" "*" " " " " "*"
## 5 ( 5 ) "*" "*" " " " " "*" " " " " "*" "*"
## 5 ( 6 ) "*" "*" " " " " "*" "*" "*" " " " "
## 5 ( 7 ) "*" "*" " " " " "*" " " "*" "*" " "
## 5 ( 8 ) "*" "*" "*" " " "*" "*" " " " " " "
## 5 ( 9 ) "*" "*" " " "*" " " " " " " "*" "*"
## 5 ( 10 ) "*" "*" "*" " " "*" " " " " "*" " "
## 6 ( 1 ) "*" "*" " " "*" "*" " " "*" " " "*"
## 6 ( 2 ) "*" "*" " " " " "*" " " "*" "*" "*"
## 6 ( 3 ) "*" "*" "*" " " "*" " " "*" " " "*"
## 6 ( 4 ) "*" "*" " " " " "*" "*" " " "*" "*"
## 6 ( 5 ) "*" "*" "*" "*" "*" " " "*" " " " "
## 6 ( 6 ) "*" "*" " " "*" "*" "*" " " " " "*"
## 6 ( 7 ) "*" "*" "*" " " "*" "*" " " " " "*"
## 6 ( 8 ) "*" "*" "*" " " "*" " " "*" "*" " "
## 6 ( 9 ) "*" "*" "*" " " "*" "*" "*" " " " "
## 6 ( 10 ) "*" "*" " " " " "*" "*" "*" " " "*"
## 7 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" "*"
## 7 ( 2 ) "*" "*" " " "*" "*" "*" " " "*" "*"
## 7 ( 3 ) "*" "*" "*" "*" "*" " " "*" " " "*"
## 7 ( 4 ) "*" "*" "*" " " "*" " " "*" "*" "*"
## 7 ( 5 ) "*" "*" " " "*" "*" "*" "*" " " "*"
## 7 ( 6 ) "*" "*" "*" " " "*" "*" " " "*" "*"
## 7 ( 7 ) "*" "*" "*" "*" "*" "*" " " " " "*"
## 7 ( 8 ) "*" "*" " " " " "*" "*" "*" "*" "*"
## 7 ( 9 ) "*" "*" "*" " " "*" "*" "*" " " "*"
## 7 ( 10 ) "*" "*" "*" "*" "*" " " "*" "*" " "
## 8 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*"
## 8 ( 2 ) "*" "*" " " "*" "*" "*" "*" "*" "*"
## 8 ( 3 ) "*" "*" "*" "*" "*" "*" " " "*" "*"
## 8 ( 4 ) "*" "*" "*" "*" "*" "*" "*" " " "*"
## 8 ( 5 ) "*" "*" "*" " " "*" "*" "*" "*" "*"
## 8 ( 6 ) "*" "*" "*" "*" "*" "*" "*" "*" " "
## 8 ( 7 ) "*" "*" "*" "*" " " "*" "*" "*" "*"
## 8 ( 8 ) "*" " " "*" "*" "*" "*" "*" "*" "*"
## 8 ( 9 ) " " "*" "*" "*" "*" "*" "*" "*" "*"
## plot adjusted R square for each model ##
plot(model_b4_subset, scale='adjr2')
## can use Cp, r2 or bic for scale ##
plot(model_b4_subset, scale='bic')
plot(model_b4_subset, scale='Cp')
k <- ols_step_all_possible(model_b4_full)
k
## # A tibble: 511 x 6
## Index N Predictors `R-Square` `Adj. R-Square` `Mallow's Cp`
## * <int> <int> <chr> <dbl> <dbl> <dbl>
## 1 1 1 x1 0.767 0.757 2.18
## 2 2 1 x2 0.508 0.486 26.9
## 3 3 1 x4 0.504 0.482 27.3
## 4 4 1 x3 0.422 0.395 35.1
## 5 5 1 x6 0.282 0.249 48.5
## 6 6 1 x5 0.211 0.175 55.3
## 7 7 1 x8 0.159 0.121 60.2
## 8 8 1 x7 0.0800 0.0382 67.7
## 9 9 1 x9 0.0688 0.0265 68.8
## 10 10 2 x1 x2 0.803 0.784 0.826
## # ... with 501 more rows
plot(k)
k <- ols_step_best_subset(model_b4_full)
k
## Best Subsets Regression
## -----------------------------------------
## Model Index Predictors
## -----------------------------------------
## 1 x1
## 2 x1 x2
## 3 x1 x2 x9
## 4 x1 x2 x5 x7
## 5 x1 x2 x3 x5 x7
## 6 x1 x2 x4 x5 x7 x9
## 7 x1 x2 x4 x5 x7 x8 x9
## 8 x1 x2 x3 x4 x5 x7 x8 x9
## 9 x1 x2 x3 x4 x5 x6 x7 x8 x9
## -----------------------------------------
##
## Subsets Regression Summary
## ---------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## ---------------------------------------------------------------------------------------------------------------------------------
## 1 0.7673 0.7568 0.7291 2.1808 124.1267 56.3323 127.6609 9.5680 9.4984 0.4175 0.2750
## 2 0.8025 0.7837 0.7293 0.8257 122.1907 55.5796 126.9029 8.9328 8.7704 0.3898 0.2539
## 3 0.8145 0.7867 0.707 1.6857 122.6914 57.1836 128.5817 9.2752 8.9717 0.4047 0.2597
## 4 0.8352 0.8005 0.7181 1.7095 121.8477 58.4590 128.9161 9.1543 8.6882 0.3995 0.2515
## 5 0.8422 0.7984 0.6844 3.0405 122.8033 61.1328 131.0496 9.7955 9.0831 0.4274 0.2629
## 6 0.8478 0.7941 0.6992 4.5051 123.9333 64.1315 133.3577 10.6276 9.5842 0.4638 0.2775
## 7 0.8518 0.7869 0.6527 6.1320 125.3079 67.3960 135.9103 11.7349 10.2413 0.5121 0.2965
## 8 0.8528 0.7743 0.6037 8.0297 127.1336 70.8062 138.9141 13.3142 11.1839 0.5810 0.3238
## 9 0.8531 0.7587 0.5254 10.0000 129.0827 74.2389 142.0413 15.3300 12.3198 0.6689 0.3566
## ---------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
plot(k)
k <- ols_step_both_p(model_b4_full, details = TRUE)
## Stepwise Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. x1
## 2. x2
## 3. x3
## 4. x4
## 5. x5
## 6. x6
## 7. x7
## 8. x8
## 9. x9
##
## We are selecting variables based on p value...
##
##
## Stepwise Selection: Step 1
##
## - x1 added
##
## Model Summary
## -------------------------------------------------------------
## R 0.876 RMSE 2.961
## R-Squared 0.767 Coef. Var 8.555
## Adj. R-Squared 0.757 MSE 8.768
## Pred R-Squared 0.729 MAE 2.352
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -------------------------------------------------------------------
## Regression 636.156 1 636.156 72.556 0.0000
## Residual 192.891 22 8.768
## Total 829.046 23
## -------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------
## (Intercept) 13.320 2.572 5.179 0.000 7.987 18.654
## x1 3.324 0.390 0.876 8.518 0.000 2.515 4.134
## --------------------------------------------------------------------------------------
##
##
##
## Stepwise Selection: Step 2
##
## - x2 added
##
## Model Summary
## -------------------------------------------------------------
## R 0.896 RMSE 2.792
## R-Squared 0.803 Coef. Var 8.067
## Adj. R-Squared 0.784 MSE 7.796
## Pred R-Squared 0.729 MAE 2.160
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -------------------------------------------------------------------
## Regression 665.332 2 332.666 42.672 0.0000
## Residual 163.714 21 7.796
## Total 829.046 23
## -------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------
## (Intercept) 10.042 2.958 3.394 0.003 3.889 16.194
## x1 2.713 0.485 0.715 5.595 0.000 1.705 3.722
## x2 6.164 3.186 0.247 1.935 0.067 -0.462 12.791
## ---------------------------------------------------------------------------------------
##
##
##
## Model Summary
## -------------------------------------------------------------
## R 0.896 RMSE 2.792
## R-Squared 0.803 Coef. Var 8.067
## Adj. R-Squared 0.784 MSE 7.796
## Pred R-Squared 0.729 MAE 2.160
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -------------------------------------------------------------------
## Regression 665.332 2 332.666 42.672 0.0000
## Residual 163.714 21 7.796
## Total 829.046 23
## -------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------
## (Intercept) 10.042 2.958 3.394 0.003 3.889 16.194
## x1 2.713 0.485 0.715 5.595 0.000 1.705 3.722
## x2 6.164 3.186 0.247 1.935 0.067 -0.462 12.791
## ---------------------------------------------------------------------------------------
##
##
##
## No more variables to be added/removed.
##
##
## Final Model Output
## ------------------
##
## Model Summary
## -------------------------------------------------------------
## R 0.896 RMSE 2.792
## R-Squared 0.803 Coef. Var 8.067
## Adj. R-Squared 0.784 MSE 7.796
## Pred R-Squared 0.729 MAE 2.160
## -------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -------------------------------------------------------------------
## Regression 665.332 2 332.666 42.672 0.0000
## Residual 163.714 21 7.796
## Total 829.046 23
## -------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------
## (Intercept) 10.042 2.958 3.394 0.003 3.889 16.194
## x1 2.713 0.485 0.715 5.595 0.000 1.705 3.722
## x2 6.164 3.186 0.247 1.935 0.067 -0.462 12.791
## ---------------------------------------------------------------------------------------
k
##
## Stepwise Selection Summary
## ------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## ------------------------------------------------------------------------------------
## 1 x1 addition 0.767 0.757 2.1810 124.1267 2.9610
## 2 x2 addition 0.803 0.784 0.8260 122.1907 2.7921
## ------------------------------------------------------------------------------------
plot(k)
k<- ols_step_both_aic(model_b4_full)
## Stepwise Selection Method
## -------------------------
##
## Candidate Terms:
##
## 1 . x1
## 2 . x2
## 3 . x3
## 4 . x4
## 5 . x5
## 6 . x6
## 7 . x7
## 8 . x8
## 9 . x9
##
##
## Variables Entered/Removed:
##
## - x1 added
## - x2 added
##
## No more variables to be added or removed.
k
##
##
## Stepwise Summary
## -----------------------------------------------------------------------------
## Variable Method AIC RSS Sum Sq R-Sq Adj. R-Sq
## -----------------------------------------------------------------------------
## x1 addition 124.127 192.891 636.156 0.76733 0.75676
## x2 addition 122.191 163.714 665.332 0.80253 0.78372
## -----------------------------------------------------------------------------
plot(k)
# The following function in the SignifReg package can be used to change selection criterion easily to p value or others. Change options accordingly before running it #
library(SignifReg)
# SignifReg(scope, data, alpha = 0.05, direction = "forward", criterion = "p-value", correction = "FDR") #
# Check it on your own later #
# Final models #
model_b4_1 <- lm(y ~ x1 + x2, data=table_b4)
summary(model_b4_1)
##
## Call:
## lm(formula = y ~ x1 + x2, data = table_b4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7639 -1.9454 -0.1822 1.8068 5.0423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.0418 2.9585 3.394 0.00273 **
## x1 2.7134 0.4849 5.595 1.49e-05 ***
## x2 6.1643 3.1864 1.935 0.06663 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.792 on 21 degrees of freedom
## Multiple R-squared: 0.8025, Adjusted R-squared: 0.7837
## F-statistic: 42.67 on 2 and 21 DF, p-value: 4.007e-08
anova(model_b4_1)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 636.16 636.16 81.6013 1.114e-08 ***
## x2 1 29.18 29.18 3.7426 0.06663 .
## Residuals 21 163.71 7.80
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vif(model_b4_1)
## x1 x2
## 1.736466 1.736466
confint(model_b4_1, level=0.1/1) # Bonferroni joint confidence interval #
## 45 % 55 %
## (Intercept) 9.665478 10.418052
## x1 2.651718 2.775079
## x2 5.758991 6.569541
plot(model_b4_1, pch=16, col="blue")
#Create Partial Regression plots #
avPlots(model_b4_1)
model_b4_backward <- lm(y ~ x1 + x2 + x5 + x7, data=table_b4)
summary(model_b4_backward)
##
## Call:
## lm(formula = y ~ x1 + x2 + x5 + x7, data = table_b4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5510 -2.0807 0.0391 1.8805 3.5912
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.5091 3.6331 3.718 0.001458 **
## x1 2.4192 0.5169 4.680 0.000163 ***
## x2 8.4802 3.2943 2.574 0.018577 *
## x5 2.0006 1.2104 1.653 0.114793
## x7 -2.1823 1.2762 -1.710 0.103557
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.681 on 19 degrees of freedom
## Multiple R-squared: 0.8352, Adjusted R-squared: 0.8005
## F-statistic: 24.08 on 4 and 19 DF, p-value: 3.249e-07
anova(model_b4_backward)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 636.16 636.16 88.4747 1.398e-08 ***
## x2 1 29.18 29.18 4.0578 0.05834 .
## x5 1 6.08 6.08 0.8449 0.36951
## x7 1 21.02 21.02 2.9239 0.10356
## Residuals 19 136.61 7.19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vif(model_b4_backward)
## x1 x2 x5 x7
## 2.138863 2.012386 1.712819 1.661124
confint(model_b4_backward, level=0.1/3) # Bonferroni joint confidence interval #
## 48.3 % 51.7 %
## (Intercept) 13.355217 13.662895
## x1 2.397272 2.441046
## x2 8.340759 8.619740
## x5 1.949333 2.051838
## x7 -2.236290 -2.128212
plot(model_b4_backward, pch=16, col="blue")
#Create Partial Regression plots #
avPlots(model_b4_backward)
deviation <- table_b4$y-mean(table_b4$y)
# Predit_Power=1-(PRESS.stat/SST)
1-((MPV::PRESS(model_b4_full))/(deviation%*%deviation)) # Compute SST by multiplying two vectors #
## [,1]
## [1,] 0.5253679
1-((MPV::PRESS(model_b4_full))/(var(table_b4$y)*(nrow(table_b4)-1)))
## [1] 0.5253679
1 Regression
In the presence of interaction terms in the model, the predictors are scaled and centered before computing the standardized betas. ols_regress() will detect interaction terms automatically but in case you have created a new variable instead of using the inline function *, you can indicate the presence of interaction terms by setting iterm to TRUE.
ols_regress(mpg ~ disp + hp + wt + qsec, data = mtcars)
2 Residual vs Fitted Values Plot
Plot to detect non-linearity, unequal error variances, and outliers.
ols_plot_resid_fit(model)
3 DFBETAs Panel
DFBETAs measure the difference in each parameter estimate with and without the influential observation. dfbetas_panel creates plots to detect influential observations using DFBETAs.
ols_plot_dfbetas(model)
4 Residual Fit Spread Plot
Plot to detect non-linearity, influential observations and outliers.
ols_plot_resid_fit_spread(model)
5 Breusch Pagan Test
Breusch Pagan test is used to test for herteroskedasticity (non-constant error variance). It tests whether the variance of the errors from a regression is dependent on the values of the independent variables. It is a χ2 test.
ols_test_breusch_pagan(model)
6 Collinearity Diagnostics
ols_coll_diag(model)
7 Stepwise Regression
Build regression model from a set of candidate predictor variables by entering and removing predictors based on p values, in a stepwise manner until there is no variable left to enter or remove any more.
model <- lm(y ~ ., data = surgical) ols_step_both_p(model)
8 Stepwise AIC Backward Regression
Build regression model from a set of candidate predictor variables by removing predictors based on Akaike Information Criteria, in a stepwise manner until there is no variable left to remove any more.
k <- ols_step_backward_aic(model)
1 All Possible Regression
ols_step_all_possible(model)
plot(k)
2 Best Subset Regression
Select the subset of predictors that do the best at meeting some well-defined objective criterion, such as having the largest R2 value or the smallest MSE, Mallow’s Cp or AIC.
ols_step_best_subset(model)
plot(k)
3 Stepwise Forward Regression
ols_step_forward_p(model)
plot(k)
ols_step_forward_p(model, details = TRUE)
4 Stepwise Backward Regression
ols_step_backward_p(model, details = TRUE)
5 Stepwise Regression
ols_step_both_p(model, details = TRUE)
6 Stepwise AIC Forward Regression
ols_step_forward_aic(model, details = TRUE)
7 Stepwise AIC Backward Regression
ols_step_backward_aic(model, details = TRUE)
8 Stepwise AIC Regression
ols_step_both_aic(model, details = TRUE)
The standard regression assumptions include the following about residuals/errors: - The error has a normal distribution (normality assumption). - The errors have mean zero. - The errors have same but unknown variance (homoscedasticity assumption). - The error are independent of each other (independent errors assumption)
1 Residual QQ Plot
ols_plot_resid_qq(model)
2 Residual Normality Test
Test for detecting violation of normality assumption.
ols_test_normality(model)
Correlation between observed residuals and expected residuals under normality.
ols_test_correlation(model)
3 Residual vs Fitted Values Plot
Characteristics of a well behaved residual vs fitted plot:
ols_plot_resid_fit(model)
4 Residual Histogram
Histogram of residuals for detecting violation of normality assumption.
ols_plot_resid_hist(model)
1 Bartlett Test
ols_test_bartlett(hsb, read, group_var = female)
ols_test_bartlett(hsb, read, write)
2 Breusch Pagan Test
ols_test_breusch_pagan(model)
ols_test_breusch_pagan(model, rhs = TRUE)
ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE)
ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE, p.adj = ‘bonferroni’)
ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE, p.adj = ‘sidak’)
ols_test_breusch_pagan(model, rhs = TRUE, multiple = TRUE, p.adj = ‘holm’)
3 Score Test
ols_test_score(model)
ols_test_score(model, rhs = TRUE)
ols_test_score(model, vars = c(‘disp’, ‘hp’))
4 F Test
ols_test_f(model)
ols_test_f(model, rhs = TRUE)
ols_test_f(model, vars = c(‘disp’, ‘hp’))
1 Cook’s D Bar Plot
Steps to compute Cook’s distance:
A data point having a large cook’s d indicates that the data point strongly influences the fitted values
ols_plot_cooksd_bar(model)
2 Cook’s D Chart
ols_plot_cooksd_chart(model)
3 DFBETAs Panel
DFBETA measures the difference in each parameter estimate with and without the influential point. There is a DFBETA for each data point i.e if there are n observations and k variables, there will be n∗k DFBETAs. In general, large values of DFBETAS indicate observations that are influential in estimating a given parameter. Belsley, Kuh, and Welsch recommend 2 as a general cutoff value to indicate influential observations and \(\frac{2}{\sqrt{n}}\) as a size-adjusted cutoff.
ols_plot_dfbetas(model)
4 DFFITs Plot
Steps to compute DFFITs:
An observation is deemed influential if the absolute value of its DFFITS value is greater than:\({2}*{\frac{\sqrt{(p + 1)}}{(n - p -1)}}\) where n is the number of observations and p is the number of predictors including intercept.
ols_plot_dffits(model)
5 Studentized Residual Plot
Plot for detecting outliers. Studentized deleted residuals (or externally studentized residuals) is the deleted residual divided by its estimated standard deviation. Studentized residuals are going to be more effective for detecting outlying Y observations than standardized residuals. If an observation has an externally studentized residual that is larger than 3 (in absolute value) we can call it an outlier.
ols_plot_resid_stud(model)
6 Standardized Residual Chart
Chart for detecting outliers. Standardized residual (internally studentized) is the residual divided by estimated standard deviation.
ols_plot_resid_stand(model)
7 Studentized Residuals vs Leverage Plot
Graph for detecting influential observations.
ols_plot_resid_lev(model)
8 Deleted Studentized Residual vs Fitted Values Plot
ols_plot_resid_stud_fit(model)
9 Hadi Plot
Hadi’s measure of influence based on the fact that influential observations can be present in either the response variable or in the predictors or both. The plot is used to detect influential observations based on Hadi’s measure.
ols_plot_hadi(model)
10 Potential Residual Plot
Plot to aid in classifying unusual observations as high-leverage points, outliers, or a combination of both.
ols_plot_resid_pot(model)
1 Collinearity Diagnostics
ols_vif_tol(model)
Percent of variance in the predictor that cannot be accounted for by other predictors.
Most multivariate statistical approaches involve decomposing a correlation matrix into linear combinations of variables. The linear combinations are chosen so that the first combination has the largest possible variance (subject to some restrictions we won’t discuss), the second combination has the next largest variance, subject to being uncorrelated with the first, the third has the largest possible variance, subject to being uncorrelated with the first and second, and so forth. The variance of each of these linear combinations is called an eigenvalue. Collinearity is spotted by finding 2 or more variables that have large proportions of variance (.50 or more) that correspond to large condition indices. A rule of thumb is to label as large those condition indices in the range of 30 or larger.
ols_eigen_cindex(model)
ols_coll_diag(model)
2 Model Fit Assessment
ols_plot_resid_fit_spread(model)
Correlations: Relative importance of independent variables in determining Y. How much each variable uniquely contributes to R2 over and above that which can be accounted for by the other predictors.
Zero Order: Pearson correlation coefficient between the dependent variable and the independent variables.
Part: Unique contribution of independent variables. How much R2 will decrease if that variable is removed from the model?
Partial: How much of the variance in Y, which is not estimated by the other independent variables in the model, is estimated by the specific variable?
ols_correlations(model)
ols_plot_obs_fit(model)
ols_pure_error_anova(model)
ols_plot_diagnostics(model)
3 Variable Contributions
Graph to determine whether we should add a new predictor to the model already containing other predictors. The residuals from the model is regressed on the new predictor and if the plot shows non random pattern, you should consider adding the new predictor to the model.
ols_plot_resid_regressor(model, drat)
Regress Y on all variables other than X and store the residuals (Y residuals).
Regress X on all the other variables included in the model (X residuals).
Construct a scatter plot of Y residuals and X residuals.
ols_plot_added_variable(model)
Regress Y on all variables including X and store the residuals (e). Multiply e with regression coefficient of X (eX). Construct scatter plot of eX and X
ols_plot_comp_plus_resid(model)